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Abstract 

An advantage of methods that base inference on a posterior distribution is that 
uncertainty quantification, in the form of credible regions, is readily obtained. Ex¬ 
cept in well-specified situations, however, there is no guarantee that these credible 
regions will be calibrated in the sense that they achieve the nominal frequentist 
coverage probability, even approximately. To overcome this difficulty, we propose 
a general strategy—applicable to Bayes, Gibbs, and variational Bayes posteriors, 
among others—that introduces an additional scalar tuning parameter to control 
the posterior spread, and we develop an algorithm that chooses this parameter so 
that the corresponding credible region achieves the nominal coverage probability. 
Simulation results demonstrate that the proposed algorithm yields highly efficient 
credible regions in a variety of applications compared to existing methods. 

Keywords and phrases: Bootstrap; coverage probability; Gibbs model; misspec- 
ified model; variational Bayes. 


1 Introduction 


An advantage of Bayesian and other more general Bayesian-like methods that base their 
inference on a snitable posterior distribntion is that nncertainty qnantihcation, in the form 
of credible regions for the unknown parameters, is readily available. For this uncertainty 
qnantihcation to be meaningful, it is common to require that the specihed credibility level 
agrees, at least approximately, with the frequentist coverage probability, i.e., that the 95% 
credibility regions read off from the posterior are approximately 95% conhdence regions. 
In this case, we say that the posterior credible region is calibrated. For well-specihed 
Bayesian models, one often has a Bernstein-von Mises theorem available to justify a 
calibration claim, but when the model is misspecihed in at least one of several possible 


ways, calibration often fails. For example, Kleijn and van der Vaart (2012) derived a 


Bernstein-von Mises theorem for Bayesian posteriors under model misspecihcation, and 
pointed out that, even if concentration target and rate are correct, misspecihcation can 
still cause a lack of calibration; see page 362 in their paper and Section j^below. Similarly, 
the commonly used variational Bayes posteriors (e.g., Jaakola and Jordan [l997| Jordan 
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et al.||l9^) often lack the desired calibration property, and correcting this is listed as 


one 


of the important open problems in Blei et ah (2017). 

To address this problem, we propose to introduce, to the given posterior, an additional 
scalar tuning parameter, intended to control the spread of the posterior distribution. This 
formulation is inspired by the literature on Gibbs posteriors, where data and parameter 
of interest are connected via a loss function, instead of a likelihood; see, e.g., IBissiri 


et ah (2016), Alquier et ah (2016), Zhang (2006), Jiang and Tanner (2008), and [Syring 


and Martin (2017). In such cases, a scale—or inverse temperature—parameter must be 
specihed to properly weight the information in the data relative to that in the prior, but 
this ultimately boils down to tuning the Gibbs posterior spread. A similar formulation 
can be carried out for other Bayesian-like models, not just Gibbs posteriors; see Section 
Having introduced an extra parameter into the posterior, we then propose to select this 
tuning parameter such that the corresponding posterior credible regions are calibrated in 
the sense described above, and we present an algorithm, based on bootstrap and other 
Monte Garlo techniques, to implement this idea efficiently. 

Similar questions about scaling posterior distributions to address one or more types 
of model misspecihcation have been considered recently in the literature. In particular. 


Holmes and Walker 

(2017 

), including hierarchical Bayes and loss/information matching. 

and 

Griinwald and Van Ommen 

(2016 

) propose to choose the scale parameter to mini- 


mize a type of prediction risk, similar in spirit to cross validation. These proposals are 
reasonable, but they do not provide any guarantees that the uncertainty quantihcation 
coming from the corresponding posterior distribution is meaningful. In contrast, our pro¬ 
posal here is designed specihcally to make the corresponding posterior credible regions 
calibrated, at least approximately. The claimed calibration follows immediately from our 
construction, and the simulations presented in Section]^ covering several different models 
and types of posteriors, demonstrate the effectiveness of the proposed method. 

The remainder of the paper is organized as follows. Section sets our notation, 
dehnes our modihed posterior distribution, with an extra calibration parameter, and 
explains the intuition behind our proposed approach. The general posterior calibration 
algorithm is presented in Section and we discuss its basic properties. Section [^contains 
several examples, including a Gibbs posterior in quantile regression, a misspecihed Bayes 
posterior in linear regression, and a variational Bayes posterior in a mixture model, and 
Section makes some concluding remarks. 


2 Problem formulation 

Suppose we have data Z"' = (Zi ,..., Z„) consisting of iid observations from a distribution 
P; here, each Zi could be a vector or even a response-predictor variable pair, i.e., Zi = 
(Xi,Yi). The quantity of interest is a parameter 6* = 0(P), a feature of the underlying 
distribution P, taking values in 0. Gonsider the following general construction of a 
posterior distribution for inference on 0. 

• Gonnect data Z” to a full set of parameters r] through either a statistical model for 
P, as in Bayes or other likelihood-based settings, or a suitable loss function, as in 
Gibbs or M-estimation settings. 


2 






























































• Introduce a prior 11 for the full parameter r], and a scale a; > 0 to weight the 
information about t] in the data with that in the prior. 

• Combine the prior, scale, and likelihood/loss to get a posterior distribution for r]. 

• Integrate to get the corresponding marginal posterior for 9, denoted by 


This general recipe includes both the Bayesian and Gibbs posterior procedure, as well 
as variational Bayes, as we demonstrate in Section It also covers classical empirical 
Bayes or other posteriors based on data-dependent priors (e.g., Fraser et ah [20 10 Hannig 
et ah 2016 Martin and Walker 2016). The one technical requirement we have is that 


the posterior be consistent in the sense that it concentrates, asymptotically, on the 
actual value 6{P) for each hxed u. Consistency must be verihed case-by-case, but this is 
standard; see Section Given that the posterior Ylri,ui is approximately centered around 
0*, the use of credible regions to quantify uncertainty is reasonable. 

For concreteness, consider the problem of estimating the median 6* of a distribution 
F; more complicated examples are presented in Section The median can be dehned as 
the minimizer of the risk R{6) = FG, the expected value of the loss le{,z) = 1^ —6^1, under 
F. This loss forms a connection between and 9 and a Gibbs posterior is dehned as 


Un,Ud9) 


( 1 ) 


where Rn{9) = is the empirical version of the risk, a; > 0 is a scale parameter, and 
n is a prior for 9. As an alternative, a Bayesian might specify a statistical model, such 
as Gamma(a,/9), with likelihood Lnir]) expressed in terms of parameters r] = {a,f3), and 
a prior 11 for r], and dehne a (scaled) posterior for 9 as 


oc [ Lnir]Tll{dr]), 

J{ri:F-Hl/2)eA} 

where F^ denotes the Gamma(a,/9) distribution function. The choice between these two 
approaches, or variations thereof, depends largely on the willingness of the data analyst 
to specify a full model as well as on the goals of the analysis; the Gibbs approach provides 
inference on the median but nothing else, with minimal modeling assumptions, whereas 
the Bayes approach provides inference on virtually any feature, but with higher modeling 
and computational costs. In either case, the choice of scale u is important. 

Our proposed choice of scale is based on calibrating the posterior credible regions 
to be used for uncertainty quantihcation. Fix a level a G (0,1) and, for concreteness, 
consider the highest posterior density credible regions defined as 

c^Azn = {d : > cA, ( 2 ) 

where is the density function corresponding to the posterior and Ca is a constant 
chosen so that the n„^tj-probability assigned to C^iAZ^) is equal to 1 — a. The scale 
parameter u controls the spread of the posterior and, thereby, the size of these credible 
regions. Our proposal is to choose u so that the credible regions are of the appropriate 
size to be calibrated, i.e., so that their coverage probability, P{C^AZ^) 3 d{P)}, is 
approximately equal to 1 — a; see Section 

To better understand our proposal, recall that, in the classical setting of a well- 
specihed Bayesian model with suitable regularity, the credible region will be calibrated. 
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Figure 1: Contours of the asymptotic distribution of the M-estimator (solid) and those 
of the asymptotic Gibbs posterior (dashed). Left: a; = 1; Right: a; = 1/4. 


at least asymptotically, when a; = 1. There has been recent interest in the misspecihed 
case and, in particular, Kleijn and van der Vaart (2012) showed that even if a Bernstein- 
von Mises theorem holds, the posterior credible regions might not be calibrated. Roughly 
speaking, misspecihcation affects the shape of the posterior contours, which may be the 
wrong shape compared to the sampling distribution of the corresponding M-estimator. 
Varying the scale parameter can provide a conservative solution to this problem: we can 
stretch the posterior contours enough that they contain a differently-shaped but suitably 
calibrated confidence region; see Figure for an illustration and Remark below. 


3 Posterior calibration algorithm 

As discussed previously, our goal is to select the calibration parameter uj such that the 
corresponding posterior credible region are calibrated in the sense that the credibility 
level agrees, at least approximately, with the coverage probability. To this end, for our 
desired significance level a G (0,1), and our preferred credible region as in ([^, 

dehne the coverage probability function 

\ P) = P{C^,a{Z^) 3 9{P)}, 

i.e., the P-probability that the credible region contains the target 9{P). Then 

calibration requires that u be such that 

Caiu) I P) = 1 - a, (3) 

i.e., that the 100(1 —«)% posterior credible region is also a 100(1 —«)% conhdence region. 
Of course, in practice, we cannot solve this equation because we do not know P. The 
approach described below is designed to get around this practical roadblock. Before we 
proceed, note that solving ([^ is a hxed-n exercise, so our aim is to get exact calibration 
in hnite samples. Asymptotic approximations come into play, however, because P is 
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unknown in real applications, bnt the nnmerical illustrations in Section demonstrate 
that we are, in fact, able to achieve exact calibration, at least in some cases. 

To bnild up our intuition, start by assnming that P is known; later we will switch to 
the more realistic case of nnknown P. Even in this unrealistic case, it is generally not 
possible to solve for a; in (|^ explicitly, so numerical methods are required. It is possible to 
solve ([^ via stochastic approximation (Blei et al 


2017 Kushner and Yin 2003 Robbins 


and Monro 1951) by iterating according to the rnle 


u 


0+1) = cuW + K4{c„(a;W | p) - (1 - a)}, t > 0 


(4) 


where Ca{uj | P) is a Monte Carlo approximation to the coverage probability, obtained 
by simnlating new copies of the data from P, and (nt) is a non-stochastic seqnence 
snch that Kt = oo and < oo. For onr nnmerical results in Section we use 

Kt = {t + 1)“^/^. In this case, if Ca{uj \ P) is continnous and monotone decreasing in 
u, both very reasonabl e assnmptions, then it f o llows from the almost snpermartingale 
convergence theorem of Robbins and Siegmnnd (1971), that o;^*) —)■ u* P-almost surely, 
as f —)• oo, where u* is the solution to (|^. 

For the realistic case where P is nnknown, the proposed approach changes in two 
ways. First, since it is not possible to sample new copies of Z"' from P, we replace 
simulation from P with simnlation from P„, i.e., we sample with replacement from the 
observed data Z”. Second, since we also do not know 6*(P), we cannot check if a given 
credible region Ct^,Q(Z”') covers it. Instead, we nse 9(Fn) in place of 0{P). This results 
in an empirical version of the coverage probability Ca{oj \ P), namely. 


C„(a;|PO=Pn{a,a(^")3 0(Pn)}, 


(5) 


and onr proposal is to hnd u such that 


c„(a; I P„) = 1 - a. 


( 6 ) 


In practice, we cannot evalnate Ca(uj \ P„) either, bnt bootstrap will provide a Monte 
Carlo estimator, which we denote by Ca{u \ P„). We can proceed to solve ([^ by nsing 
the same stochastic approximation procedure described above for the known-P case. 
Collectively, these steps to solve this equation make np our general posterior calibration 
(GPC) algorithm. An R code implementation for each of the examples in Section is 
available at https://github.com/nasyring/GPC, 


Remark 1. In most applications, the credible regions will not be available in 

closed form, so posterior sampling will be needed in snch cases. But despite having 
several moving parts—bootstrap, MCMC, and stochastic approximation—the proposed 
GPC algorithm is relatively inexpensive compntationally. For example, in the quantile 
regression problem in Section 4T, with a two-dimensional parameter, sample size n = 100, 
B = 200 bootstrap samples, and M = 2000 posterior samples, the algorithm took less 
than 10 seconds to converge on a Windows desktop computer with a 4.0 GHz Intel Core 
i7 processor. We believe that a minimal extra compntational investment is a fair trade 
for calibrated posterior credible regions. 


Remark 2. The workhorses of the GPC algorithm, namely, bootstrap, stochastic approx¬ 
imation, and MCMC, are widely used and, on their own, theoretically sonnd. Bnt having 
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Algorithm 1 — General Posterior Calibration. 

Fix a convergence tolerance e > 0 and an initial guess of the calibration parameter. 
Take B bootstrap samples Z",..., of size n. Set t = 0 and do: 

1. Construct credible regions for each b = 1,... ,B. 

2. Evaluate the empirical coverage Ca{uj^^^ \ Pn) as in (|^. 

3. If I Pn) “ (1 ~ ct)| < then stop and return as the output; otherwise, 

update to according to Q, set t <(— t + 1, and go back to Step 1. 


them all working together in tandem makes a general theoretical analysis of the GPC 
algorithm very challenging. Here, we provide only some technical insight as to why the 
algorithm works, leaving a complete theoretical analysis for future research. 

Let the interest parameter be dehned via a risk function R{0) = Pig, so that 0{P) = 
argmini?(6*). In this case, a Gibbs posterior Hn^^; is dehned as in Q, where Pn{0) = 
Pnf'e is the empirical risk. Under suitable regularity conditions, the Gibbs posterior 
will resemble a normal distribution, centered at ^(Pn) = argmini?„(6'), with asymptotic 
covariance matrix where and Vg is the second derivative matrix 

of P{6). So, the credible region Cui,a{,Z'^) will look roughly like 

{6 : u{e - - ^(PO) < ^ 4 , 

where 4 is the appropriate chi-square quantile. On the other hand, the asymptotic 
covariance matrix of ^(Pn) is given by = n~^Vg^p-^MVg^py where M = Pig(^p)iJ^p~^ and 

ig is the derivative of 9 ig, so an asymptotic conhdence region is 

{9:(9-9(P„))T>1<,T'(9-9(P„))<5„}. 

In general, and are different, so the credible region has a different shape than 

the conhdence region and, therefore, may not be calibrated. But the GPG algorithm will 
take uj roughly equal to the smallest eigenvalue of Tn , so that the credible 

region contains the aforementioned conhdence region, making the latter conservatively 
calibrated; exact calibration is possible only if = cS„ for some scalar c > 0. Since 
has a limit as n —)■ oo, we expect that our GPG solution to ([^ will converge 
to the smallest eigenvalue of that limiting matrix, hence asymptotic calibration. 

For a quick proof-of-concept, suppose the data are iid P and the population mean 
9 = f zdP(z) is the quantity of interest. Here we take P = N(0,1), so that 9 = 0. We 
consider three posterior distributions: a Bayes model using the correct normal likelihood; 
a Gibbs posterior using Pn{9) = Yl^=iiZi ~ and a misspecihed Bayesian posterior 
with a Laplace likelihood. For the well-specihed Bayes and the Gibbs posteriors, we 
expect the GPG algorithm to select a; ~ 1 and u ~ 0.5, respectively; for the Laplace 
model, based on the Vg and M calculations in Remarkwe expect u ~ 0.64. Figure]^ 
plots the mean trajectories of the u values obtained from our algorithm, with error bars, 
as a function of n. These results conhrm our expectations based on theory. 
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Figure 2: Mean choice of uj over 100 simulated standard normal data sets of sizes 100, 
250, 500, and 1000 using the true likelihood, a Gibbs model, and a Laplace likelihood. 
Vertical bars represent two standard deviations from the mean. 


4 Applications 

4.1 Quantile regression 

In quantile regression, for hxed r G (0,1), we are interested in the quantile of the 
response V G M, given the covariates X G expressed as 


Qr{Y I X)=X^e, 


(7) 


where dimension p +1 represents an intercept and p covariates. In this formula, the vector 
6 depends on r but, for notational simplicity, we will omit this dependence. This model 
specihes no parametric form for the conditional distribution of Y given X. Inference on 
the quantile regression coefficient 6 may be carried out using asymptotic approximations 


(Koenker 2005, Theorem 4.1) or by using the bootstrap (Horowitz 1998). A Bayesian 
approach would also be attractive, but no distributional form for the conditional distri¬ 
bution is given in ([^, hence no likelihood. A workaround that has been considered by 
several authors (e.g., [Sriram 2015 Sriram et ah 2013 Yu and Moyeed 2001) is to use 


a (misspecified) asymmetric Laplace likelihood. This corresponds to a Gibbs model ([^ 
using the empirical risk 


RniO) = -Y, l(i'i - - G-X7 »<o)I 

Tl ^ 


2=1 


based on the usual check-loss function, where Zi = (Xj, V), z = 1,. 
vations, and / is the indicator function. 


( 8 ) 


, n, are the obser- 


It follows from Kleijn and van der Vaart (2012) that the Gibbs posterior based on 


satisfies a Bernstein-von Mises theorem. Despite the desirable convergence result, the 
variance mismatch discussed in Section causes the credible regions to be too large and 
over-cover, a sign of inefficiency. On the other hand, the GPG algorithm calibrates the 
intervals exactly, for all n, without loss of efficiency in terms of interval lengths. 
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n 



Coverage Probability 



Average Length 


BEL.s 

BDL 

Normal 

III 

o 

bo 

GPC 

BEL.s 

BDL 

Normal 

e 

III 

o 

bo 

GPC 

100 

% 

0.97 

0.98 

0.95 

0.96 

0.95 

1.06 

1.11 

1.00 

1.00 

0.91 


01 

0.98 

0.98 

0.98 

0.98 

0.95 

0.58 

0.58 

0.55 

0.52 

0.47 

400 

00 

0.95 

0.98 

0.95 

0.95 

0.95 

0.50 

0.55 

0.50 

0.49 

0.46 


01 

0.97 

0.98 

0.97 

0.96 

0.95 

0.26 

0.28 

0.25 

0.25 

0.23 

1600 

00 

0.96 

0.97 

0.96 

0.95 

0.95 

0.25 

0.28 

0.25 

0.24 

0.23 


01 

0.96 

0.98 

0.96 

0.96 

0.95 

0.13 

0.14 

0.12 

0.12 

0.11 


Table 1: Comparison of 95% posterior credible intervals of the median regression param¬ 
eters from five methods: BEL.s; BDL; Normal; the conhdence interval computed using 
the asymptotic normality of the M-estimator; uj = 0.8, the scaled posterior with uj hxed 
equal to 0.8; and GPC. Coverage probability and average interval lengths are computed 
over 5000 simulated data sets for our method, normal intervals, and hxed-ca intervals. 
Results for BEL.s and BDL are taken from Yang and He (2012) and were calculated from 
1000 simulated data sets. 


To demonstrate this, we revisit a simulation example presented in Yang and He (2012). 
For T = 0.5, the model they consider is 


bi — ^0 + + D) i — 1,..., n, 

where 6*o = 2, = 1, ~ N(0,4), and Xj ~ ChiSq(2) — 2. For this model, the 

authors showed numerically that their proposed Bayesian empirical likelihood approach 
(“BEL.s”) produced credible intervals with approximate coverage near the nominal 95% 
level. Moreover, compared to the Bayesian method with misspecihed asymmetric Laplace 
likelihood (“BDL”) or, equivalently, our posterior with uj chosen by averaging residuals, 
their method is shown to be more efficient in terms of interval length. The results for 
these methods are presented in Table along with the results from the posterior intervals 
scaled by the algorithm. 

There are two key observations to be made. First, our method calibrates the credible 
intervals to have exact 95% coverage across the range of n, while the other methods tend 
to over-cover. Second, our credible intervals tend to be shorter than those of the other 
methods, especially for n = 100. All three methods have a convergence rate so, 

for large n, we cannot expect to see substantial differences between the various methods. 
Therefore, the small-n case should be the most important and, at least in this case, the 
credible intervals calibrated using our algorithm are clearly the best. 

Finally, considering that in smooth models we expect uj to account for the difference 
in asymptotic variance between the posterior and the M-estimator, it is reasonable to ask 
if we need a calibration algorithm at all, i.e., can we get by with a hxed value of ca based 
on these asymptotic variances? A comparison of the asymptotic variance of the posterior 
with that of the M-estimator shows that 0.80S“^ therefore, we can take u = 0.80 

in an attempt to calibrate posterior credible intervals with a hxed scaling. Table [T] shows 
that our algorithm is still better than using a hxed scale based on asymptotic normality, 
especially at smaller sample sizes where the normal approximation is less justihable. 
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4.2 Linear regression 

Consider the usual multiple linear regression model for data (X*, Yj) eMP x 

Yi = f3o +Xjp + aci, i = 


(9) 


where /3 G is the vector of slope coefficients, a > 0 is an unknown scale parameter, and 
Cl,..., are assumed to be iid N(0,1). Suppose, however, that the constant error vari¬ 
ance assumption is violated, in particular, e* ~ N(0, cr^||Xj||), i = 1,... ,n, independent. 
Our choice of predictor-dependent variance is a less-stylized version of that in Griinwald] 
and Van Ommen (2016). The proposed model is, therefore, misspecihed, but our goal is 


still to obtain calibrated inference on 6 = (/3o,/5). 


The Jeffreys prior is a reasonable default choice with density 7i{t]) oc (ct^) (Ibrahim 


and Laud]l991 ) for the full parameter rj = {6, cr^). Since this prior is probability-matching 
for the location-scale model (e.g., Datta and Mukerjee 2004), we may expect that the 


posterior credible intervals would be approximately calibrated for our linear regression. 
However, for a misspecihed model, calibration might fail; in fact, as shown in Table 
the credible intervals are too narrow and tend to undercover. 

To investigate the performance of our proposed posterior calibration method compared 
to several others, we carry out a simulation study. We simulated data sets of n = 50 
observations. Each Xj G is multivariate normal with zero mean and unit variance for 
each element, and pairwise correlation 0.5 for Xu and Xi 2 and zero otherwise. To sample 
Yi we use (3o = 0, {3 = (1,2,—1)"'^, and cr = 1. Although the error variance contains 
||Xj||, the regular tests for constant variance do not detect the heteroscedasticity. Table 
shows the estimated coverage probability and mean lengths of several posterior credible 
intervals for the components of 9. Besides those scaled by the GPC algorithm, we consider 
a misspecihed Bayes approach that hxes ca = 1, and posteriors with scale u chosen by 
the method in Holmes and Walker] (2017) and the SafeBayes method in Grunwald and 
Van Ommen (2016, Algorithm 1). The results in Table show that for this example 


SafeBayes performs similarly to GPG, while the method in [Holmes and Walker (2017) 
does not improve upon the misspecihed Bayesian model in terms of calibration. 

Figure shows a boxplot comparison of the scale parameters chosen by the three 
posterior scaling methods for the misspecihed Bayesian posterior. Our algorithm, along 
with the SafeBayes method, tends to produce smaller values of u that the method of 
Holmes and Walker. Small values of u mean higher posterior variance and wider credible 
intervals, which explains these method’s improvement in calibration. While both our 
algorithm and SafeBayes pick u ~ 0.8 on average, the distribution of u is much more 
concentrated using our algorithm. 


4.3 Variational inference for a normal mixture model 

Variational inference ohers a competing method to Markov chain Monte Garlo for approxi¬ 
mating the posterior distribution. This approach specihes a family of distributions—often 
a normal family—as candidate posteriors and then chooses the parameters of that family 
to minimize the Kullback-Leibler divergence from the true posterior. The variational 


posterior is simple by construction and, if carefully chosen, will be consistent (e.g., Wang 


and Titterington 2005), but as noted in Blei et ah (2017), misspecihcation causes the 


variational posterior variance to be too small. 
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/3o 

Pi 

P2 

Ps 

Misspecihed Bayes 

GPG 

SafeBayes 

Holmes and Walker 

coverage 

length 

coverage 

length 

coverage 

length 

coverage 

length 

0.94 

0.99(0.15) 

0.98 

1.17(0.18) 

0.96 

1.19(0.26) 

0.91 

0.87(0.18) 

0.89 

1.16(0.20) 

0.94 

1.36(0.23) 

0.93 

1.40(0.31) 

0.84 

1.01(0.22) 

0.88 

1.16(0.20) 

0.94 

1.36(0.24) 

0.94 

1.39(0.33) 

0.80 

1.01(0.22) 

0.87 

1.01(0.17) 

0.93 

1.18(0.20) 

0.92 

1.21(0.28) 

0.82 

0.87(0.18) 


Table 2: Empirical coverage probabilities of 95% credible intervals and average inter¬ 
val lengths (and standard deviations) calculated using 5000 simulations from the model 
described in Section 14.21 


Holmes & Walker 


Figure 3: Boxplots of u for the model described in Section |4.2| using GPC, SafeBayes 
(Griinwald and Van Ommen 2016), and the method in Holmes and Walker ( |2017 ) over 
5000 simulated data sets. 


As an example, we consider the normal mixture model presented in Blei et ah (2017), 
i.e., Vi,..., W are iid observations from the mixture model 


K 

E 

k=l 




( 10 ) 


The full parameter p consists of the mixture weights (tti, ..., means (pi,..., ^k), 
and variances (af,... but we will consider inference only on the means. We can 

construct a variational posterior for rj following Algorithm 2 in Blei et ah ( 2017| , which 
approximates the posterior by a multivariate normal. The additional scale factor u in our 
modihed variational posterior T\.n,uj only adjusts the overall scale of this multivariate nor¬ 
mal. Therefore, if mi,... ,mK and vi,...,vk are the means and variances, respectively, 
of this variational posterior for the mixture means pi,... ,^ik, then the corresponding 
(u-scaled variational posterior 100(1 — a)% credible intervals are of the form 


1 * 1/2 


k=l,...,K. 
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/ii 

/^2 

coverage 

length 

0.96 

0.96 

0.67 (0.08) 

0.67 (0.08) 

coverage 

length 

0.92 

0.92 

0.55 (0.03) 

0.55 (0.03) 


Table 3: Empirical coverage probability and average length (standard deviation) of the 
credible intervals for {( 11 ,^ 2 ) based on onr GPC algorithm and the variational posterior 
(VI) in Blei et ah (2017) over 5000 simnlated data sets from the mixtnre model (10). 


It is straightforward to incorporate this variational posterior setnp into onr GPC algo¬ 
rithm; the compntational investment is in carrying ont the optimization needed for the 
variational approximation at each bootstrap step, bnt then the credible intervals are 
available in closed-form so no posterior sampling is needed. 

We claim that the GPC algorithm will properly scale the variational posterior, cali¬ 
brating the corresponding credible intervals, correcting the nnder-estimation of variance 
noted in Blei et ah (2017). To demonstrate this, we carry ont a simple simnlation stndy. 
We take K = 2, iii = 112 = 1/2, (/ii,/i 2 ) = (—2,2), and cti = ct 2 = 1. Table shows 
the empirical coverage probabilities and mean lengths of the 95% credible intervals based 
on Algorithm 2 in Blei et ah (2017) and onr GPC algorithm. Apparently, onr GPC 


algorithm corrects the nnderestimated variance of the variational posterior, prodncing 
credible intervals that are slightly conservative. 


5 Discussion 

The sensitivity of Bayesian credible sets to the posited probability model makes obtaining 
calibrated inference a challenging problem. Onr linear regression example demonstrates 
this sensitivity when we take the model for granted. However, misspecification can happen 
in a variety of settings, and not always nnintentionally. In qnantile regression, the model 
is determined by a risk fnnction rather than a likelihood, making traditional Bayesian in¬ 
ference nsing the trne likelihood elnsive. And, other times, compntational considerations 
make variational posteriors an attractive alternative to a fnlly Bayesian analysis. Onr 
posterior calibration algorithm may provide a solntion in all of these settings by correcting 
model misspecification to prodnce, at least approximately, calibrated inferences. 

Althongh the focns in this paper is on misspecihed models, it may still be desirable 
to apply onr algorithm even when the trne likelihood is nsed. The reason is that the onr 
algorithm can aid in prodncing calibrated inferences for the given sample size, regardless 
of the prior distribntion nsed. This facilitates the nse of informative priors, if available, 
instead of defanlt priors, while still gaining the desired calibration property. 

Finally, while it is clear that the GPC algorithm prodnces approximately calibrated 
credible sets, a detailed theoretical stndy is needed. The techniqnes we have nsed— 
stochastic approximation, MCMC, and bootstrap—each are theoretically sonnd on their 
own, bnt very complicated when nsed in tandem. Fnrther work in this direction may help 
provide gnidance, bnt the lack of completely rigorons theory does not take away from the 
enconraging examples shown thronghont the paper. 
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